Linear Regression & Seaborn: Introduction

Do higher film budgets lead to more box office revenue? Let's find out if there's a relationship using the movie budgets and financial performance data that I've scraped from the-numbers.com on May 1st, 2018.

Import Statements

Notebook Presentation

Read the Data

Explore and Clean the Data

Challenge: Answer these questions about the dataset:

  1. How many rows and columns does the dataset contain?
  2. Are there any NaN values present?
  3. Are there any duplicate rows?
  4. What are the data types of the columns?

Data Type Conversions

Challenge: Convert the USD_Production_Budget, USD_Worldwide_Gross, and USD_Domestic_Gross columns to a numeric format by removing $ signs and ,.

Note that domestic in this context refers to the United States.

Challenge: Convert the Release_Date column to a Pandas Datetime type.

Descriptive Statistics

Challenge:

  1. What is the average production budget of the films in the data set?
  2. What is the average worldwide gross revenue of films?
  3. What were the minimums for worldwide and domestic revenue?
  4. Are the bottom 25% of films actually profitable or do they lose money?
  5. What are the highest production budget and highest worldwide gross revenue of any film?
  6. How much revenue did the lowest and highest budget films make?

Investigating the Zero Revenue Films

Challenge How many films grossed $0 domestically (i.e., in the United States)? What were the highest budget films that grossed nothing?

Challenge: How many films grossed $0 worldwide? What are the highest budget films that had no revenue internationally?

Filtering on Multiple Conditions

Challenge: Use the .query() function to accomplish the same thing. Create a subset for international releases that had some worldwide gross revenue, but made zero revenue in the United States.

Hint: This time you'll have to use the and keyword.

Unreleased Films

Challenge:

Films that Lost Money

Challenge: What is the percentage of films where the production costs exceeded the worldwide gross revenue? 

Seaborn for Data Viz: Bubble Charts

Plotting Movie Releases over Time

Challenge: Try to create the following Bubble Chart.

Converting Years to Decades Trick

Challenge: Create a column in data_clean that has the decade of the release.

Here's how:

  1. Create a DatetimeIndex object from the Release_Date column.
  2. Grab all the years from the DatetimeIndex object using the .year property.
  3. Use floor division // to convert the year data to the decades of the films.
  4. Add the decades as a Decade column to the data_clean DataFrame.

Separate the "old" (before 1969) and "New" (1970s onwards) Films

Challenge: Create two new DataFrames: old_films and new_films

Seaborn Regression Plots

Challenge: Use Seaborn's .regplot() to show the scatter plot and linear regression line against the new_films.

Style the chart

Interpret the chart

Run Your Own Regression with scikit-learn

$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$

Challenge: Run a linear regression for the old_films. Calculate the intercept, slope and r-squared. How much of the variance in movie revenue does the linear model explain in this case?

Use Your Model to Make a Prediction

We just estimated the slope and intercept! Remember that our Linear Model has the following form:

$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$

Challenge: How much global revenue does our model estimate for a film with a budget of $350 million?